Deep Learning Mini-Challenge 2: Image Captioning

Task description: The aim of this minichallenge is to train an image captioning model. After the training, the model should be able to receive an image and generate a single sentence describing the captured scene. This work is strongly inspired by the paper from Vinyals: Show and Tell: A Neural Image Caption Generator (https://arxiv.org/pdf/1411.4555.pdf).

Description of the dataset: The Flickr8k data set is used for training the models. It consists of 8091 different images with varying resolutions. The images were collected from six different Flickr groups and were manually selected to include a range of scenes and situations. In addition, 5 captions are included in the dataset for each image which results in a total of 40455 captions.

Import data

Explorative data analysis

Visualization of Images with their corresponding captions.

Description: We see the first six images from the dataset with their corresponding captions. The images have varying resolutions. The scenes contain people or animals performing a simple action. The captions in the dataset seem relatively clean at first glance. However, there are case-sensitive differences for individual words. In general, some editing will be necessary for the images and the captions, but the effort will probably not be too high.

Average caption lengths

Description: The captions in the data set are a maximum of 38 words long. A word length of 19 would already be sufficient for over 95 percent of the captions.

Image resolutions

Descriptions: The visualization shows the resolution of the images with their number of pixels in height and width. As already recognized in the visualized samples, the images vary in their resolution. However, a clear maximum boundary of 500 pixels is noticeable in the height and width over all images. For the CNN, it is necessary in our case that all images have the same resolution. For this reason, the images must be processed in a next step.

Preprocessing

Preprocessing Images

This section handles the preprocessing of the images, which includes the following transformations:

Preprocessing Captions

In this section, the captions for the images are preprocessed. The captions are originally provided as strings. In a first step they are processed using the basic_english tokenizer included in the torchtext library. It performs several operations such as: lowercasing and replacing certain symbols using a pattern dict. We also limit the maximum number of words per caption to 20, since over 95 percent of all captions are within this range. Sentences with less than 20 words are padded using the <pad> token. Finally, we mark the beginning <bos> and end <eos> with the corresponding tokens, giving all captions a fixed length of 22 tokens.

Define Embedding

Our network does not use the actual words from the captions in the dataset but takes features from an embedding as a vector representation. The aim of embedding is to transform the high dimensionality of the words into a vector space in a meaningful way. This space is significantly smaller in dimensionality, which makes the training more efficient. On one hand, it is necessary to define the vocabulary. On the other hand, it is required to create an embedding for the individual words with this vocabulary. There are two ways to do this. Firstly, it is possible to train an own embedding. The implementation is not too complicated, but since the embedding must also be trained, I would expect the learning curve of the network might be slower and it could prevent the model from reaching the full potential of the predictions. The second option would be to use pre-trained word embedding vectors. An example of pre-trained word embedding would be by using GloVe. For this project we want to train a model with and without pretrained embedding vectors and compare the training after.

Below we use the torchtext Vocab class to generate a onehot encoded vocabulary from the captions and read out the corresponding vectors from GloVe. In addition, the vocabulary is supplemented with our four keywords \<bos>, \<eos>, \<pad> and \<unk>.

Description: We see that the raw glove embedding contains a vocabulary of 400,000 tokens. Our reduced vocabulary, on the other hand, has 4094 words, barely 1 percent of that. For these words we now have the corresponding pre-trained embeding vocab.vectors which could be passed to the embedding layer in our later model.

Encoding of the dataframe

Description: We see an example of how the encoding works. The special tokens \<bos> and \<eos> are correctly encoded and decoded again. Since "Anton" and ":)" do not occur in our generated vocabulary, they are encoded with the unknown token \<unk>. So the encoding and decoding works. In a next step, we therefore apply the onehotencoding to the entire data set.

Train-test split

For training, we divide the data set into a training set and a test set with a ratio of 4/1. Thereby it is relevant to make sure that all captions of a picture are in the same subset.

Create train and test set

Define the dataloader

Model Structure

The EncoderCNN model consists of two main components: The image is fed into a deep Convolutional Neural Network (CNN). This generates a vector representation, which is extracted from the last hidden layer. The resulting vector is then used as feature input of the DecoderRNN, which contains a a Long Short Term Memory (LSTM) to generate the sentence structure.

CNN

For the CNN, we use the pytorch libaray Restnet18 model which has been pre-trained on the ImageNet dataset on a image classification task of 1000 classes. In general, it would also be possible to use any other CNN structure for this task. To be able to use the network for our captioning task, the last hidden layer has to be manually modified and trainied using transfer learning to match the feature vectors on the embedding layer. Therefore the output of the linear layer has to match the dimensions of the embedding layer vectors for the subsequent LSTM network.

LSTM

The LSTM is now used to decode the feature vector. While training, it receives as input a PackedSequence consisting of the concatenation of the feature vectors and the embedding. This enables the network to recieve inputs of varying lengths. The output of the LSTM is then transformed back into the vocab_size dimension by an additional linear layer. In this way, the linear layer serves as reverse encoding, whereby the output represents the weights for the assignment to the words in our vocabulary.

Due to the additional "packing" of the labels using pytorchs pack_padded_sequence method, we gain the advantage that the \<pad> tokens are not included in the cost function when calculating the crossentropy loss.

During training, the LSTM receives as inputs the tokens that were generated on the original captions. The output of the LSTM is ignored in this stage. If instead the generated output from the last iteration were used as input, the predictions of the LSTM in the later iterations of a sequence would be very strongly dependent on the previously generated output. The network would train on assumptions that are often incorrect, especially at the beginning of the training, and would therefore slow down the training of the network.

When predicting with the trained network, no captions are included, so the highest probability token from the output of the previous iteration is used as input for the next iteration of the LSTM.

Combination

Both model classes are integrated into a single CNNtoRNN class. This is primarily for structural reasons and allows to call the functionalities of both models in a combined class structure. Additionally, the function for captioning a single image is integrated here.

Define models

Training

Both models learn for 200 epochs on the training data set. The loss function is calculated using Crossentropy Loss. The optimization is done with the Adam optimizer from Pytorch.

Evaluation

BLEU Score

For the evaluation of the captions, the BLEU score is used. BLEU stands for "Bilingual Evaluation Understudy" and is currently one of the most commonly used metrics to compare the similarity between machine-generated and human natural language. The score itself varies between 0 and 1 where 0 means no match and 1 means a perfect match between the generated captions and the test captions.

For the calculation, n-grams are first formed from the comparison sets. Usually 1-Gram bid 4-Gram. Then the precision is calculated from the generated sets and the test sets for all n-grams and the geometric mean of the precision scores is calculated. The resulting Geometric Average Precision Score is then multiplied by the Brevity Penalty, which is a penalty measure for different sentence lengths.

One of the biggest advantages of the BLEU score is that it is relatively uncomplicated and quick to calculate. A possible issue with the metric is that the BLEU score cannot interpret the context of the generated text. For example, it is not possible for the metric to recognise synonyms, which is why the final test score also depends on how good and extensive the predefined captions in the test data set are for checking.

Visualizing Loss of the train set

During the training, the summarized crossentropy loss per epoch was recorded. We want to investigate if there was a difference in the training of the two variants (with and without pretrained embedding) in relation to the training loss.

Description: The visualization shows the training loss for the two models: pretrained_embedding and untrained_embedding. We see that both models optimize the loss function in a smooth learning path, and there is apparently no difference in the development of the training loss between the two models over all epochs despite my expectations.

Visualizing the average BLEU of 128 randomized samples in the test set

In addition, 128 images were drawn from the test data set after each training epoch. A caption was then created for these and compared with the five correct captions using the BLEU score. For this reason, the average BLEU score of the two model variants (with and without pretrained embedding) per epoch are shown below. For the calculation of the BLEU score a n-gramm range of 1 to 4 was used.

Description: When looking at the average BLEU scores of the two models, we see that the average value varies greatly during the various epochs for both model variants. Since the drawn samples represent only a small part of the total test data set, the variation indicates that there are larger differences in the correctness of the caption and the value is therefore still strongly dependent on the drawn captions. Here, in a later run, the number of drawn samples could be increased to obtain a more stable development. Overall, we see that the BLEU score increases relatively strongly, especially at the beginning, as the training loss decreases. In general, I also can't see any difference between the trained and untrained embedding.

Comment: Since both models seem to perform relatively itentically, it does not make sense for me to analyze both models in more detail. The further analysis therefore focuses on the evaluation of the model with the untrained embeddings.

Comparison of bleu scores for train and test set

Below, the distributions of the BLEU scores from the training and test set are analyzed. As a small addaption to the previous calculaton of the BLEU score here, the n-gram range was limited to 2. Therefore the average bleu Score appears to be higher as in the timeseries before.

Description: The histogram shows the distribution of the achieved BLEU score scale for all samples from the training set. We see that the BLEU scores are distributed over the entire value scale with a concentration around 0.32. There are several captions with very good BLEU scores greater than 0.8. Also there are many samples which have achieved a BLEU score of 0. Subsequently, we now want to make a comparison of the BLEU scores for the test set.

Description: The histogram shows the distribution of the achieved BLEU score scale for all samples from the test set. We see that there are significantly more images in the dataset that received a BLEU score of 0. Also, there are very few images in the value range 0.6 upwards. This might be because of several reasons. On the one hand, a minimal deterioration of the average BLEU score can be expected due to variance in sentence structures. Furthermore, the test set contains words and objects that were not used in the training. Thus, the model is sometimes not able to recognize these objects at all.

Visualizing single examples

Now we will take a closer look at a handful of examples with very good and very bad BLEU Scores from the test set.

Top captions

Description: We see here some examples with some of the BLEU scores from the test set. The generated captions for the images with the best BLEU Scores are very good in my opinion and the I think the high scores for these examples are justified. However, it seems at first glance that the generated captions are rather shorter than the average of the captions used in the training, but this thesis would have to be evaluated separately.

Flop captions

Description: At the other end, we see significantly worse performance. There are some images where the model is completely off like in image 3226541300_6c81711e8e.jpg or 782401952_5bc5d3413a.jpg. But there are also other images with a BLEU score around 0, which have at least partially reasonable generated captions. For example, in image 2925242998_9e0db9b4a2.jpg we see a young woman hiking instead of rock climbing and in image 295258727_eaf75e0887.jpg we see two dogs running over the grass with a forest in the background which is why the model might have interpretet them as running trough the woods.

Conclusion

Overall, I feel good about the result i was able to achieve. The quality of the captions varied from quite decent to completely off target. I think the model is able to recognize single or multiple objects relatively well, since the resnet18 was also pre-trained for this task. However, I have the impression that the LSTM can not always correctly put the individual objects in relation to each other and mostly just learns a general sentence structure based on memorized patterns. In this context, it would not surprise me if the model could not generate a correct sentence structure for a previously unseen interavtions between known objects. Also, the model generally performs better with objects and scenes that occur frequently in the training dataset. I would therefore argue that the model would be able to increase the quality of the predictions if more similar examples were present in the training data set for the bad captions.